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Transductive Multi-view Zero-Shot Learning 

Yanwei Fu, Timothy M. Hospedales, Tao Xiang and Shaogang Gong 


Abstract —Most existing zero-shot learning approaches exploit transfer learning via an intermediate semantic representation shared 
between an annotated auxiliary dataset and a target dataset with different classes and no annotation. A projection from a low-level 
feature space to the semantic representation space is learned from the auxiliary dataset and applied without adaptation to the 
target dataset. In this paper we identify two inherent limitations with these approaches. First, due to having disjoint and potentially 
unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target 
dataset/domain. We call this problem the projection domain shift problem and propose a novel framework, transductive multi-view 
embedding, to solve it. The second limitation is the prototype sparsity problem which refers to the fact that for each target class, only a 
single prototype is available for zero-shot learning given a semantic representation. To overcome this problem, a novel heterogeneous 
multi-view hypergraph label propagation method is formulated for zero-shot learning in the transductive embedding space. It effectively 
exploits the complementary information offered by different semantic representations and takes advantage of the manifold structures of 
multiple representation spaces in a coherent manner. We demonstrate through extensive experiments that the proposed approach 
(1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic 
representations, (3) significantly outperforms existing methods for both zero-shot and N-shot recognition on three image and video 
benchmark datasets, and (4) enables novel cross-view annotation tasks. 

Index Terms —Transducitve learning, multi-view Learning, transfer Learning, zero-shot Learning, heterogeneous hypergraph. 
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1 Introduction 

Humans can distinguish 30,000 basic object classes ID 
and many more subordinate ones (e.g. breeds of dogs). 
They can also create new categories dynamically from 
few examples or solely based on high-level description. 
In contrast, most existing computer vision techniques 
require hundreds of labelled samples for each object 
class in order to learn a recognition model. Inspired 
by humans' ability to recognise without seeing samples, 
and motivated by the prohibitive cost of training sample 
collection and annotation, the research area of learning to 
learn or lifelong learning l35l, |«6l has received increasing 
interests. These studies aim to intelligently apply pre¬ 
viously learned knowledge to help future recognition 
tasks. In particular, a major and topical challenge in this 
area is to build recognition models capable of recog¬ 
nising novel visual categories without labelled training 
samples, i.e. zero-shot learning (ZSL). 

The key idea underpinning ZSL approaches is to 
exploit knowledge transfer via an intermediate-level 
semantic representation. Common semantic representa¬ 
tions include binary vectors of visual attributes [271, EL 
ca (e.g. 'hasTaiT in Fig.[^ and continuous word vectors 
[321 , fn\ . [44l encoding linguistic context. In ZSL, two 
datasets with disjoint classes are considered: a labelled 
auxiliary set where a semantic representation is given for 
each data point, and a target dataset to be classified with- 
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out any labelled samples. The semantic representation is 
assumed to be shared between the auxiliary/source and 
target/test dataset. It can thus be re-used for knowledge 
transfer between the source and target sets: a projection 
function mapping low-level features to the semantic 
representation is learned from the auxiliary data by 
classifier or regressor. This projection is then applied to 
map each unlabelled target class instance into the same 
semantic space. In this space, a 'prototype' of each target 
class is specified, and each projected target instance is 
classified by measuring similarity to the class prototypes. 
Depending on the semantic space, the class prototype 
could be a binary attribute vector listing class properties 
(e.g., 'hasTail') Gzl or a word vector describing the 
linguistic context of the textual class name HTTI . 

Two inherent problems exist in this conventional zero- 
shot learning approach. The first problem is the pro¬ 
jection domain shift problem. Since the two datasets 
have different and potentially unrelated classes, the 
underlying data distributions of the classes differ, so 
do the 'ideal' projection functions between the low- 
level feature space and the semantic spaces. Therefore, 
using the projection functions learned from the aux¬ 
iliary dataset/domain without any adaptation to the 
target dataset/domain causes an unknown shift/bias. 
We call it the projection domain shift problem. This is 
illustrated in Fig. which shows two object classes 
from the Animals with Attributes (AwA) dataset (28l : 
Zebra is one of the 40 auxiliary classes while Pig is 
one of 10 target classes. Both of them share the same 
'hasTaiT semantic attribute, but the visual appearance 
of their tails differs greatly (Fig. [^a)). Similarly, many 
other attributes of Pig are visually different from the 
corresponding attributes in the auxiliary classes. Fig- 
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(a) visual space 



Figure 1. An illustration of the projection domain shift 
problem. Zero-shot prototypes are shown as red stars 
and predicted semantic attribute projections (defined in 
Sec. 3.2) shown in blue. 


ure l^b) illustrates the projection domain shift problem 
by plotting (in 2D using t-SNE |37l) an 85D attribute 
space representation of the image feature projections and 
class prototypes (85D binary attribute vectors). A large 
discrepancy can be seen between the Pig prototype in 
the semantic attribute space and the projections of its 
class member instances, but not for the auxiliary Zebra 
class. This discrepancy is caused when the projections 
learned from the 40 auxiliary classes are applied directly 
to project the Pig instances - what 'hasTaiT (as well 
as the other 84 attributes) visually means is different 
now. Such a discrepancy will inherently degrade the 
effectiveness of zero-shot recognition of the Pig class 
because the target class instances are classified according 
to their similarities/distances to those prototypes. To our 
knowledge, this problem has neither been identified nor 
addressed in the zero-shot learning literature. 

The second problem is the prototype sparsity prob¬ 
lem: for each target class, we only have a single pro¬ 
totype which is insufficient to fully represent what that 
class looks like. As shown in Figs, [^b) and (c), there 
often exist large intra-class variations and inter-class 
similarities. Consequently, even if the single prototype 
is centred among its class instances in the semantic rep¬ 
resentation space, existing zero-shot classifiers will still 
struggle to assign correct class labels - one prototype per 
class is not enough to represent the intra-class variability 
or help disambiguate class overlap (3^ . 

In addition to these two problems, conventional ap¬ 
proaches to zero-shot learning are also limited in exploit¬ 
ing multiple intermediate semantic representations. 
Each representation (or semantic View') may contain 
complementary information - useful for distinguishing 
different classes in different ways. While both visual 
attributes i27i . El, im, (El and linguistic semantic 
representations such as word vectors |[32l , (TTl . Il44t have 
been independently exploited successfully, it remains 
unattempted and non-trivial to synergistically exploit 
multiple semantic views. This is because they are often 
of very different dimensions and types and each suffers 
from different domain shift effects discussed above. 


In this paper, we propose to solve the projection 
domain shift problem using transductive multi-view 
embedding. The transductive setting means using the 
unlabelled test data to improve generalisation accuracy. 
In our framework, each unlabelled target class instance is 
represented by multiple views: its low-level feature view 
and its (biased) projections in multiple semantic spaces 
(visual attribute space and word space in this work). 
To rectify the projection domain shift between auxiliary 
and target datasets, we introduce a multi-view semantic 
space alignment process to correlate different semantic 
views and the low-level feature view by projecting them 
onto a common latent embedding space learned using 
multi-view Canonical Correlation Analysis (CCA) IflTl . 
The intuition is that when the biased target data projec¬ 
tions (semantic representations) are correlated/aligned 
with their (unbiased) low-level feature representations, 
the bias/projection domain shift is alleviated. The effects 
of this process on projection domain shift are illustrated 
by Fig. I^c), where after alignment, the target Pig class 
prototype is much closer to its member points in this 
embedding space. Furthermore, after exploiting the com¬ 
plementarity of different low-level feature and semantic 
views synergistically in the common embedding space, 
different target classes become more compact and more 
separable (see Fig. [^d) for an example), making the 
subsequent zero-shot recognition a much easier task. 

Even with the proposed transductive multi-view em¬ 
bedding framework, the prototype sparsity problem re¬ 
mains - instead of one prototype per class, a handful 
are now available depending on how many views are 
embedded, which are still sparse. Our solution is to pose 
this as a semi-supervised learning ||57l problem: proto¬ 
types in each view are treated as labelled 'instances', 
and we exploit the manifold structure of the unlabelled 
data distribution in each view in the embedding space 
via label propagation on a graph. To this end, we intro¬ 
duce a novel transductive multi-view hypergraph label 
propagation (TMV-HLP) algorithm for recognition. The 
core in our TMV-HLP algorithm is a new distributed 
representation of graph structure termed heterogeneous 
hypergraph which allows us to exploit the complemen¬ 
tarity of different semantic and low-level feature views, 
as well as the manifold structure of the target data to 
compensate for the impoverished supervision available 
from the sparse prototypes. Zero-shot learning is then 
performed by semi-supervised label propagation from 
the prototypes to the target data points within and across 
the graphs. The whole framework is illustrated in Fig. 

By combining our transductive embedding framework 
and the TMV-HLP zero-shot recognition algorithm, our 
approach generalises seamlessly when none (zero-shot), 
or few (N-shot) samples of the target classes are avail¬ 
able. Uniquely it can also synergistically exploit zero 
+ N-shot (i.e., both prototypes and labelled samples) 
learning. Furthermore, the proposed method enables a 
number of novel cross-view annotation tasks including 
zero-shot class description and zero prototype learning. 
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Our contributions Our contributions are as follows: (1) 
To our knowledge, this is the first attempt to investi¬ 
gate and provide a solution to the projection domain 
shift problem in zero-shot learning. (2) We propose a 
transductive multi-view embedding space that not only 
rectifies the projection shift, but also exploits the comple¬ 
mentarity of multiple semantic representations of visual 
data. (3) A novel transductive multi-view heterogeneous 
hypergraph label propagation algorithm is developed to 
improve both zero-shot and N-shot learning tasks in the 
embedding space and overcome the prototype sparsity 
problem. (4) The learned embedding space enables a 
number of novel cross-view annotation tasks. Extensive 
experiments are carried out and the results show that our 
approach significantly outperforms existing methods for 
both zero-shot and N-shot recognition on three image 
and video benchmark datasets. 

2 Related Work 

Semantic spaces for zero-shot learning To address 
zero-shot learning, attribute-based semantic representa¬ 
tions have been explored for images EZi, [9| and to a 
lesser extent videos El, (151. Most existing studies [271, 
EH/ (33), (34) , (4D , (54) , (D assume that an exhaustive 
ontology of attributes has been manually specified at 
either the class or instance level. However, annotating 
attributes scales poorly as ontologies tend to be domain 
specific. This is despite efforts exploring augmented 
data-driven/latent attributes at the expense of name- 
ability [IJ, (311, (15) . To address this, semantic repre¬ 
sentations using existing ontologies and incidental data 
have been proposed (38|, (371. Recently, word vector ap¬ 
proaches based on distributed language representations 
have gained popularity. In this case a word space is ex¬ 
tracted from linguistic knowledge bases e.g., Wikipedia 
by natural language processing models such as [4J/ 
(32. The language model is then used to project each 
class' textual name into this space. These projections can 
be used as prototypes for zero-shot learning ITTl , 1 44) . 
Importantly, regardless of the semantic spaces used, ex¬ 
isting methods focus on either designing better semantic 
spaces or how to best learn the projections. The former 
is orthogonal to our work - any semantic spaces can be 
used in our framework and better ones would benefit 
our model. For the latter, no existing work has identified 
or addressed the projection domain shift problem. 
Transductive zero-shot learning was considered by Fu 
et al. who introduced a generative model to for 

user-defined and latent attributes. A simple transductive 
zero-shot learning algorithm is proposed: averaging the 
prototype's k-nearest neighbours to exploit the test data 
attribute distribution. Rohrbach et al. (36) proposed a 
more elaborate transductive strategy, using graph-based 
label propagation to exploit the manifold structure of 
the test data. These studies effectively transform the 
ZSL task into a transductive semi-supervised learning 
task |571 with prototypes providing the few labelled 


instances. Nevertheless, these studies and this paper (as 
with most previous work (28) , ( 22 , ( 32 ) only consider 
recognition among the novel classes: unifying zero-shot 
with supervised learning remains an open challenge (44) • 
Domain adaptation Domain adaptation methods at¬ 
tempt to address the domain shift problems that oc¬ 
cur when the assumption that the source and target 
instances are drawn from the same distribution is vio¬ 
lated. Methods have been derived for both classification 
GqI, m and regression (45) , and both with (8| and 
without Bol requiring label information in the target 
task. Our zero-shot learning problem means that most of 
supervised domain adaptation methods are inapplicable. 
Our projection domain shift problem differs from the 
conventional domain shift problems in that (i) it is 
indirectly observed in terms of the projection shift rather 
than the feature distribution shift, and (ii) the source 
domain classes and target domain classes are completely 
different and could even be unrelated. Consequently 
our domain adaptation method differs significantly from 
the existing unsupervised ones such as |10| in that our 
method relies on correlating different representations of 
the unlabelled target data in a multi-view embedding 
space. 

Learning multi-view embedding spaces Relating low- 
level feature and semantic views of data has been ex¬ 
ploited in visual recognition and cross-modal retrieval. 
Most existing work (12, 132, ES), (HD focuses on 
modelling images/videos with associated text (e.g. tags 
on Flickr/YouTube). Multi-view CCA is often exploited 
to provide unsupervised fusion of different modalities. 
However, there are two fundamental differences be¬ 
tween previous multi-view embedding work and ours: 
(1) Our embedding space is transductive, that is, learned 
from unlabelled target data from which all semantic 
views are estimated by projection rather than being the 
original views. These projected views thus have the 
projection domain shift problem that the previous work 
does not have. (2) The objectives are different: we aim 
to rectify the projection domain shift problem via the 
embedding in order to perform better recognition and 
annotation while previous studies target primarily cross- 
modal retrieval. Note that although in this work, the 
popular CCA model is adopted for multi-view embed¬ 
ding, other models (32, ED could also be considered. 
Graph-based label propagation In most previous zero- 
shot learning studies (e.g., direct attribute prediction 
(DAP) (22), the available knowledge (a single prototype 
per target class) is very limited. There has therefore been 
recent interest in additionally exploiting the unlabelled 
target data distribution by transductive learning ( 36 ) , 
(32. However, both (32 and (El suffer from the projec¬ 
tion domain shift problem, and are unable to effectively 
exploit multiple semantic representations/views. In con¬ 
trast, after embedding, our framework synergistically 
integrates the low-level feature and semantic representa¬ 
tions by transductive multi-view hypergraph label prop¬ 
agation (TMV-HLP). Moreover, TMV-HLP generalises 
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beyond zero-shot to N-shot learning if labelled instances 
are available for the target classes. 

In a broader context, graph-based label propagation 
(551 in general, and classification on multi-view graphs 
(C-MG) in particular are well-studied in semi-supervised 
learning. Most C-MG solutions are based on the semi¬ 
nal work of Zhou et al EH which generalises spectral 
clustering from a single graph to multiple graphs by 
defining a mixture of random walks on multiple graphs. 
In the embedding space, instead of constructing lo¬ 
cal neighbourhood graphs for each view independently 
(e.g. TMV-BLP (HI), this paper proposes a distributed 
representation of pairwise similarity using heterogeneous 
hypergraphs. Such a distributed heterogeneous hyper¬ 
graph representation can better explore the higher-order 
relations between any two nodes of different comple¬ 
mentary views, and thus give rise to a more robust 
pairwise similarity graph and lead to better classification 
performance than previous multi-view graph methods 
(SH/ fl4l . Hypergraphs have been used as an effective 
tool to align multiple data/feature modalities in data 
mining (291 , multimedia fHj and computer vision |35(, 
(19] applications. A hypergraph is the generalisation of 
a 2-graph with edges connecting many nodes/vertices, 
versus connecting two nodes in conventional 2-graphs. 
This makes it cope better with noisy nodes and thus 
achieve better performance than conventional graphs 
(211/1221/ (121 • The only existing work considering hyper¬ 
graphs for multi-view data modelling is (191 . Different 
from the multi-view hypergraphs proposed in |19| which 
are homogeneous, that is, constructed in each view 
independently, we construct a multi-view heterogeneous 
hypergraph: using the nodes from one view as query 
nodes to compute hyperedges in another view. This 
novel graph structure better exploits the complementar¬ 
ity of different views in the common embedding space. 

3 Learning a Transductive Multi-View 
Embedding Space 

A schematic overview of our framework is given in 
Fig. 1^ We next introduce some notation and assump¬ 
tions, followed by the details of how to map image 
features into each semantic space, and how to map 
multiple spaces into a common embedding space. 

3.1 Problem setup 

We have cs source/auxiliary classes with ns instances 
S = {Xs, zs} and ct target classes T = {Xt, zt} 
with ut instances. Xs G and Xt G denote 

the t—dimensional low-level feature vectors of auxiliary 
and target instances respectively, zs and zt are the 
auxiliary and target class label vectors. We assume the 
auxiliary and target classes are disjoint: Hzt = 0. We 
have I different types of semantic representations; 
and represent the i-th type of -dimensional seman¬ 
tic representation for the auxiliary and target datasets 


respectively; so FJ G Note 

that for the auxiliary dataset, FJ is given as each data 
point is labelled. But for the target dataset, F/ is missing, 
and its prediction F^ from Xt is used instead. As we 
shall see, this is obtained using a projection function 
learned from the auxiliary dataset. The problem of zero- 
shot learning is to estimate zt given Xt and Y^. 

Without any labelled data for the target classes, exter¬ 
nal knowledge is needed to represent what each target 
class looks like, in the form of class prototypes. Specif¬ 
ically, each target class c has a pre-defined class-level 
semantic prototype y* in each semantic view i. In this 
paper, we consider two types of intermediate semantic 
representation (i.e. 1 = 2)- attributes and word vec¬ 
tors, which represent two distinct and complementary 
sources of information. We use F, A and V to denote 
the low-level feature, attribute and word vector spaces 
respectively. The attribute space A is typically manually 
defined using a standard ontology. For the word vector 
space V, we employ the state-of-the-art skip-gram neural 
network model (32l trained on all English Wikipedia 
article^ Using this learned model, we can project the 
textual name of any class into the V space to get its word 
vector representation. Unlike semantic attributes, it is a 
Tree' semantic representation in that this process does 
not need any human annotation. We next address how 
to project low-level features into these two spaces. 

3.2 Learning the projections of semantic spaces 

Mapping images and videos into semantic space i re¬ 
quires a projection function /* : F ^ FG This is typi¬ 
cally realised by classifier (27t or regressor (44l . In this 
paper, using the auxiliary set S, we train support vector 
classifiers /'^(•) and support vector regressors f^{’) for 
each dimensiorj^of the auxiliary class attribute and word 
vectors respectively. Then the target class instances Xt 
have the semantic projections: F^ = /^{Xt) and Y^ = 
/^(Xt). However, these predicted intermediate seman¬ 
tics have the projection domain shift problem illustrated 
in Fig. To address this, we learn a transductive multi¬ 
view semantic embedding space to align the semantic 
projections with the low-level features of target data. 

3.3 Transductive multi-view embedding 

We introduce a multi-view semantic alignment (i.e. 
transductive multi-view embedding) process to correlate 
target instances in different (biased) semantic view pro¬ 
jections with their low-level feature view. This process 
alleviates the projection domain shift problem, as well 
as providing a common space in which heterogeneous 
views can be directly compared, and their comple¬ 
mentarity exploited (Sec. |^. To this end, we employ 
multi-view Canonical Correlation Analysis (CCA) for 

1. To 13 Feb. 2014, it includes 2.9 billion words from a 4.33 million- 
words vocabulary (single and bi/tri-gram words). 

2. Note that methods for learning projection functions for all dimen¬ 
sions jointly exist (e.g. fill ) and can be adopted in our framework. 
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Figure 2. The pipeline of our framework illustrated on the task of classifying unlabelled target data into two classes. 


ny views, with the target data representation in view 
i denoted a riT x matrix. Specifically, we project 
three views of each target class instance 
and Xt (i.e. ny = / + 1 = 3) into a shared embedding 
space. The three projection functions are learned by 


min 


s.t. 

I 


EZ=i Trace{W%jW^) 


E nv 

[W^YyiW^ = I 
i,j = ,nv 


II fp 

=0 

k,l = 1, - ■ ■ ,nT (1) 


where is the projection matrix which maps the view 
(g W^Txrrii^ embedding space and is 

the kth column of W^. T^ij is the covariance matrix 
between and . The optimisation problem above is 
multi-convex as long as T^u are non-singular. The local 
optimum can be easily found by iteratively maximising 
over each given the current values of the other 
coefficients as detailed in [181 . 

The dimensionality rrie of the embedding space is the 
sum of the input view dimensions, i.e. me = Er=im., 
SO G Compared to the classic approach to 

CCA C THl which projects to a lower dimension space, this 
retains all the input information including uncorrelated 
dimensions which may be valuable and complementary. 
Side-stepping the task of explicitly selecting a subspace 
dimension, we use a more stable and effective soft- 
weighting strategy to implicitly emphasise significant 
dimensions in the embedding space. This can be seen 
as a generalisation of standard dimension reducing 
approaches to CCA, which implicitly define a binary 
weight vector that activates a subset of dimensions and 
deactivates others. Since the importance of each dimen¬ 
sion is reflected by its corresponding eigenvalue ||18|, 
Ga, we use the eigenvalues to weight the dimensions 


and define a weighted embedding space F: 

= = ^*w^b\ ( 2 ) 

where F)Ms a diagonal matrix with its diagonal elements 
set to the eigenvalues of each dimension in the embed¬ 
ding space, A is a power weight of and empirically 
set to 4 fTZl, and 4^* is the final representation of the 
target data from view i in F. We index the ny = ^ views 
as i G {X,V,A} for notational convenience. The same 
formulation can be used if more views are available. 
Similarity in the embedding space The choice of 
similarity metric is important for high-dimensional em¬ 
bedding spaces. For the subsequent recognition and 
annotation tasks, we compute cosine distance in F by /2 
normalisation: normalising any vector -0^ (the k-th row 
of 4^*) to unit length (i.e. || || 2 = 1). Cosine similarity 

is given by the inner product of any two vectors in F. 

4 Recognition by Multi-view Hyper¬ 
graph Label Propagation 

For zero-shot recognition, each target class c to be 
recognised has a semantic prototype y* in each view 
i. Similarly, we have three views of each unlabelled 
instance /'^{Xt), /^{Xt) and Xt- The class prototypes 
are expected to be the mean of the distribution of their 
class in semantic space, since the projection function 
p is trained to map instances to their class prototype 
in each semantic view. To exploit the learned space F 
to improve recognition, we project both the unlabelled 
instances and the prototypes into the embedding spac^ 
The prototypes y* for views i G {A,V} are projectec 
as = y^W^D^. So we have and -0^ for the 
attribute and word vector prototypes of each target class 

3. Before being projected into F, the prototypes are updated by semi- 
latent zero shot learning algorithm in ITSl . 





















c in r. In the absence of a prototype for the (non- 
semantic) low-level feature view A', we synthesise it as 
'^c — {'^'c + If labelled data is available (i.e., N- 

shot case), these are also projected into the space. Recog¬ 
nition could now be achieved using NN classification 
with the embedded prototypes/N-shots as labelled data. 
However, this does not effectively exploit the multi-view 
complementarity, and suffers from labelled data (proto¬ 
type) sparsity. To solve this problem, we next introduce a 
unified framework to fuse the views and transductively 
exploit the manifold structure of the unlabelled target 
data to perform zero-shot and N-shot learning. 

Most or all of the target instances are unlabelled, so 
classification based on the sparse prototypes is effec¬ 
tively a semi-supervised learning problem ^57\ . We lever¬ 
age graph-based semi-supervised learning to exploit the 
manifold structure of the unlabelled data transductively 
for classification. This differs from the conventional ap¬ 
proaches such as direct attribute prediction (DAP) 12^ 
or NN, which too simplistically assume that the data 
distribution for each target class is Gaussian or multi¬ 
nomial. However, since our embedding space contains 
multiple projections of the target data and prototypes, 
it is hard to define a single graph that synergistically 
exploits the manifold structure of all views. We therefore 
construct multiple graphs within and across views in a 
transductive multi-view hypergraph label propagation 
(TMV-HLP) model. Specifically, we construct the het¬ 
erogeneous hypergraphs across views to combine/align 
the different manifold structures so as to enhance the 
robustness and exploit the complementarity of different 
views. Semi-supervised learning is then performed by 
propagating the labels from the sparse prototypes (zero- 
shot) and/or the few labelled target instances (N-shot) 
to the unlabelled data using random walk on the graphs. 


4.1 Constructing heterogeneous hypergraphs 

Pairwise node similarity The key idea behind a hyper¬ 
graph based method is to group similar data points, rep¬ 
resented as vertices/nodes on a graph, into hyperedges, 
so that the subsequent computation is less sensitive to in¬ 
dividual noisy nodes. With the hyperedges, the pairwise 
similarity between two data points are measured as the 
similarity between the two hyperedges that they belong 
to, instead of that between the two nodes only. For 
both forming hyperedges and computing the similarity 
between two hyperedges, pairwise similarity between 
two graph nodes needs to be defined. In our embedding 
space r, each data point in each view defines a node, 
and the similarity between any pair of nodes is: 

(3) 

w 

where < is the square of inner product 

between the ith and jth projections of nodes k and I with 


a bandwidth parameter Note that Eq ^ defines the 
pairwise similarity between any two nodes within the 
same view {i = j) or across different views {i ^ j). 
Heterogeneous hyperedges Given the multi-view pro¬ 
jections of the target data, we aim to construct a set of 
across-view heterogeneous hypergraphs 

g^ = {g^ni,je{x,v,A},i^j} (4) 

where denotes the cross-view 

heterogeneous hypergraph from view i to j (in that 
order) and is the node set in view i; is the 
hyperedge set and is the pairwise node similarity 
set for the hyperedges. Specifically, we have the 
hyperedge set \ iAj, k = 1, ■ ■ ■ ut + ct\ 

where each hyperedge includes the nodea^ 

in view j that are the most similar to node 
-0^ in view i and the similarity set = 

= h fc = l,---nr + CT} 

where uj is computed using Eq 

We call the query node for hyperedge , since the 
hyperedge intrinsically groups all nodes in view j 
that are most similar to node -0^ in view i. Similarly, G^^ 
can be constructed by using nodes from view j to query 
nodes in view i. Therefore given three views, we have six 
across view/heterogeneous hypergraphs. Figure illus¬ 
trates two heterogeneous hypergrahs constructed from 
two views. Interestingly, our way of defining hyperedges 
naturally corresponds to the star expansion |46| where 
the query node (i.e. -0^) is introduced to connect each 
node in the hyperedge . 

Similarity strength of hyperedge For each hyperedge 
, we measure its similarity strength by using its query 

nodes -0^. Specifically, we use the weight to indicate 
the similarity strength of nodes connected within the 
hyperedge . Thus, we define based on the mean 

similarity of the set for the hyperedge 

^ (V’LV’i) , (5) 

^k ^k 

where | | is the cardinality of hyperedge . 

In the embedding space T, similarity sets and 

A^i can be compared. Nevertheless, these sets come 
from heterogeneous views and have varying scales. Thus 
some normalisation steps are necessary to make the 
two similarity sets more comparable and the subsequent 
computation more robust. Specifically, we extend zero- 
score normalisation to the similarity sets: (a) We assume 

4. Most previous work 1^ . 1^ sets vj by cross-validation. Inspired 
by | 26 l, a simpler strategy for setting vo is adopted: tu median < 

A:,Z = 1,--- ,n 

-0^, 0^ in order to have roughly the same number of similar and 
dissimilar sample pairs. This makes the edge weights from different 
pairs of nodes more comparable. 

5. Both the unlabelled samples and the prototypes are nodes. 
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Figure 3. An example of constructing heterogeneous hypergraphs. Suppose in the embedding space, we have 14 
nodes belonging to 7 data points A, B, C, D, E, F and G of two views - view i (rectangle) and view j (circle). Data 
points A,B,C and D,E,F,G belong to two different classes - red and green respectively. The multi-view semantic 
embedding maximises the correlations (connected by black dash lines) between the two views of the same node. Two 
hypergraphs are shown at the left and QA at the right) with the heterogeneous hyperedges drawn with red/green 
dash ovals for the nodes of red/green classes. Each hyperedge consists of two most similar nodes to the query node. 


G and should follow Gaussian distribu- 
tion. Thus, we enforce zero-score normalisation to A*T. 

'^k 

(b) We further assume that the retrieved similarity set 
between all the queried nodes -0^ (/ = 1, • • • tit) from 
view i and -0^ should also follow Gaussian distributions. 
So we again enforce Gaussian distribution to the pair¬ 
wise similarities between and all query nodes from 
view i by zero-score normalisation, (c) We select the first 
K highest values from A^^ as new similarity set A^^ for 

hyperedge . A^^ is then used in Eq i5i in place of 


A^^. These normalisation steps aim to compute a more 
robust similarity between each pair of hyperedges. 
Computing similarity between hyperedges With 
the hypergraph, the similarity between two nodes is 
computed using their hyperedges . Specifically, for 
each hyperedge there is an associated^ incidence matrix 


= 




{nT+CT)x\E'i-j\ 


where 


{4,4.^ 


1 if ipj e e; 


0 


otherwise 


( 6 ) 


To take into consideration the similarity strength be¬ 
tween hyperedge and query node, we extend the binary 
valued hyperedge incidence matrix to soft-assigned 
incidence matrix = (^sh 
follows 


(ut+Ct) X\E'^3 I 


as 


sh 


(- 0 ^,- 0 /) ( 7 ) 


This soft-assigned incidence matrix is the product of 
three components: (1) the weight for hyperedge 
; (2) the pairwise similarity computed using queried 
node (3) the binary valued hyperedge incidence 
matrix element h ^. To make the values of 

comparable among the different heterogeneous views, 
we apply I 2 normalisation to the soft-assigned incidence 
matrix values for all node incident to each hyperedge. 


Now for each heterogeneous hypergraph, we can fi¬ 
nally define the pairwise similarity between any two 
nodes or hyperedges. Specifically for , the similarity 
between the o-th and l-th nodes is 

{i’L w) = E J ■ {^i ’ J • 

With this pairwise hyperedge similarity, the hyper¬ 
graph definition is now complete. Empirically, given a 
node, other nodes on the graph that have very low 
similarities will have very limited effects on its label. 
Thus, to reduce computational cost, we only use the K- 
nearest-neighbour (KNNj|^ nodes of each node |[57l for 
the subsequent label propagation step. 

The advantages of heterogeneous hypergraphs We 
argue that the pairwise similarity of heterogeneous hy¬ 
pergraphs is a distributed representation Q. To explain 
it, we can use star extension (461 to extend a hypergraph 
into a 2-graph. Eor each hyperedge , the query node 

is used to compute the pairwise similarity A^^ of all 
the nodes in view j. Each hyperedge can thus define a 
hyper-plane by categorising the nodes in view j into two 
groups: strong and weak similarity group regarding to 
query node -0^. In other words, the hyperedge set E^^ 
is multi-clustering with linearly separated regions (by 
each hyperplane) per classes. Since the final pairwise 
similarity in Eq ^ can be represented by a set of 
similarity weights computed by hyperedge, and such 
weights are not mutually exclusive and are statistically 
independent, we consider the heterogeneous hypergraph 
a distributed representation. The advantage of having 
a distributed representation has been studied by Watts 
and Strogatz ||52|, ||53| which shows that such a rep¬ 
resentation gives rise to better convergence rates and 
better clustering abilities. In contrast, the homogeneous 
hypergraphs adopted by previous work 1221 , IFT^ . (191 

6. K = 30. It can be varied from 10 ~ 50 with little effect in our 
experiments. 








does not have this property which makes them less 
robust against noise. In addition, fusing different views 
in the early stage of graph construction potentially can 
lead to better exploitation of the complementarity of 
different views. However, it is worth pointing out that 
(1) The reason we can query nodes across views to 
construct heterogeneous hypergraph is because we have 
projected all views in the same embedding space in the 
first place. (2) Hypergraphs typically gain robustness at 
the cost of losing discriminative power - it essentially 
blurs the boundary of different clusters/classes by taking 
average over hyperedges. A typical solution is to fuse 
hypergraphs with 2-graphs (121 , (T^ , (29l , which we 
adopt here as well. 


4.2 Label propagation by random walk 

Now we have two types of graphs: heterogeneous hy¬ 
pergraphs = {G^^} and 2-graph^^ = {G^}. Given 
three views {ny = 3), we thus have nine graphs in total 
(six hypergraphs and three 2-graphs). To classify the un¬ 
labelled nodes, we need to propagate label information 
from the prototype nodes across the graph. Such semi- 
supervised label propagation |53, ||57| has a closed-form 
solution and is explained as a random walk. A random 
walk requires pairwise transition probability for nodes 
k and 1. We obtain this by aggregating the information 
from all graphs G = {G^] G""}, 


p{k^l)= Y. p{k^l\G")-p{G"\k)+ (9) 

ie{xy,A} 

Y p{k^i\gA-p{Q'^ \ k) 


where 


and 


p{k^i\g^) = 






( 10 ) 


/c’ 




and then the posterior probability to choose graph G^ at 
projection/node will be: 


p{Q'\k) 

p{g'^\k) 


_ 7r(fc|g")p(y) _ 

Ei T^{k\g^)p{g^) + J2ij TT{k\g^^)p{g^^) 
_ 7r(fc|g^J)p(y-^) _ 

Ei 7r(fc|0*)p(^*) + Eij 


( 11 ) 

( 12 ) 


where p{G^) and p{G^^) are the prior probability of 
graphs G^ and G^^ in the random walk. This probability 
expresses prior expectation about the informativeness of 
each graph. The same Bayesian model averaging (141 can 
be used here to estimate these prior probabilities. How¬ 
ever, the computational cost is combinatorially increased 
with the number of views; and it turns out the prior is 
not critical to the results of our framework. Therefore, 
uniform prior is used in our experiments. 


The stationary probabilities for node k in G'^ and G^^ 
are 


7r(fc ^?®) = 


(13) 

Eo 

T{k\g^n = 

E; 

(14) 



Finally, the stationary probability across the multi¬ 
view hypergraph is computed as: 

^ 7r{k\G'') ’ p{G")^ (15) 

7r{k\g^n ■ p{g^^) (16) 

Given the defined graphs and random walk process, we 
can derive our label propagation algorithm (TMV-HLP). 
Let P denote the transition probability matrix defined by 
Eq§ and n the diagonal matrix with the elements 7r{k) 
computed by Eq ( p3] ). The Laplacian matrix C combines 
information of different views and is defined as: £ = 
n — label matrix Z for labelled N-shot data 

or zero-shot prototypes is defined as: 

{ 1 g/c G class c 

-1 class c (17) 

0 unknown 

Given the label matrix Z and Laplacian £, label propa¬ 
gation on multiple graphs has the closed-form solution 
(56l : Z = r]{r]Il + where is a regularisation 

parameteij^ Note that in our framework, both labelled 
target class instances and prototypes are modelled as 
graph nodes. Thus the difference between zero-shot and 
N-shot learning lies only on the initial labelled instances: 
Zero-shot learning has the prototypes as labelled nodes; 
N-shot has instances as labelled nodes; and a new con¬ 
dition exploiting both prototypes and N-shot together 
is possible. This unified recognition framework thus 
applies when either or both of prototypes and labelled 
instances are available. The computational cost of our 
TMV-HLP is O ((ct + nr)^ • n^ + (ct + nrf), where K 
is the number of nearest neighbours in the KNN graphs, 
and ny is the number of views. It costs ©((ct+^t)^ -riy) 
to construct the heterogeneous graph, while the inverse 
matrix of Laplacian matrix C in label propagation step 
will take 0{{ct PnT)^) computational time, which how¬ 
ever can be further reduced to 0{cTnTt) using the recent 
work of Fujiwara et al. (161 , where t is an iteration 
parameter in their paper and t nr- 

5 Annotation and Beyond 

Our multi-view embedding space F bridges the semantic 
gap between low-level features A’ and semantic repre¬ 
sentations A and V. Leveraging this cross-view mapping. 


7. That is the K-nearest-neighbour graph of each view in F flU . 


8. It can be varied from 1 — 10 with little effects in our experiments. 
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annotation (2Qi, ED, ma can be improved and applied 
in novel ways. We consider three annotation tasks here: 
Instance level annotation Given a new instance u, we 
can describe/annotate it by predicting its attributes. The 
conventional solution is directly applying 
for test data 13/ llTl . However, as analysed before, 
this suffers from the projection domain shift problem. 
To alleviate this, our multi-view embedding space aligns 
the semantic attribute projections with the low-level 
features of each unlabelled instance in the target domain. 
This alignment can be used for image annotation in 
the target domain. Thus, with our framework, we can 
now infer attributes for any test instance via the learned 

embedding space T as = yiuW^ 


Zero-shot class description From a broader machine 
intelligence perspective, one might be interested to ask 
what are the attributes of an unseen class, based solely 
on the class name. Given our multi-view embedding 
space, we can infer the semantic attribute description of 
a novel class. This zero-shot class description task could 
be useful, for example, to hypothesise the zero-shot 
attribute prototype of a class instead of defining it by ex¬ 
perts Il27l or ontology 1151 . Our transductive embedding 
enables this task by connecting semantic word space 
(i.e. naming) and discriminative attribute space (i.e. de¬ 
scribing). Given the prototype from the name of a 

novel class c, we compute 

to generate the class-level attribute description. 

Zero prototype learning This task is the inverse of the 
previous task - to infer the name of class given a set 
of attributes. It could be useful, for example, to validate 
or assess a proposed zero-shot attribute prototype, or 
to provide an automated semantic-property based index 
into a dictionary or database. To our knowledge, this is 
the first attempt to evaluate the quality of a class at¬ 
tribute prototype because no previous work has directly 
and systematically linked linguistic knowledge space 
with visual attribute space. Specifically given an attribute 


prototype y^, we can use y^ = y'^W'^D^ 




to name the corresponding class and perform retrieval 
on dictionary words in V using y^. 


6 Experiments 
6.1 Datasets and settings 

We evaluate our framework on three widely used im¬ 
age/video datasets: Animals with Attributes (AwA), 
Unstructured Social Activity Attribute (USAA), and 
Caltech-UCSD-Birds (CUB). Aw A Il27t consists of 50 
classes of animals (30,475 images) and 85 associated 
class-level attributes. It has a standard source/target 
split for zero-shot learning with 10 classes and 6,180 
images held out as the target dataset. We use the 
same "hand-crafted' low-level features (RGB colour his¬ 
tograms, SIFT, rgSIFT, PHOG, SURF and local self¬ 
similarity histograms) released with the dataset (denoted 


as H)', and the same multi-kernel learning (MKL) at¬ 
tribute classifier from ED. USAA is a video dataset 
ESI with 69 instance-level attributes for 8 classes of 
complex (unstructured) social group activity videos from 
YouTube. Each class has around 100 training and test 
videos respectively. USAA provides the instance-level at¬ 
tributes since there are significant intra-class variations. 
We use the thresholded mean of instances from each 
class to define a binary attribute prototype as in 1(151 . 
The same setting in 1151 is adopted: 4 classes as source 
and 4 classes as target data. We use exactly the same 
SIFT, MFCC and STIP low-level features for USAA as 
in 1151 . CUB-200-2011 i43 contains 11,788 images of 
200 bird classes. This is more challenging than Aw A 
- it is designed for fine-grained recognition and has 
more classes but fewer images. Each class is annotated 
with 312 binary attributes derived from a bird species 
ontology. We use 150 classes as auxiliary data, holding 
out 50 as test data. We extract 128 dimensional SIFT 
and colour histogram descriptors from regular grid of 
multi-scale and aggregate them into image-level feature 
Fisher Vectors {F) by using 256 Gaussians, as in [Tj. 
Colour histogram and PHOG features are also used to 
extract global color and texture cues from each image. 
Due to the recent progress on deep learning based 
representations, we also extract OverFeat (O) ||12|^ from 
AwA and CUB as an alternative to 1-L and T respectively. 
In addition, DeCAF {V) [7| is also considered for AwA. 

We report absolute classification accuracy on USAA 
and mean accuracy for AwA and CUB for direct com¬ 
parison to published results. The word vector space is 
trained by the model in 1(33 with 1,000 dimensions. 

6.2 Recognition by zero-shot learning 

6.2.1 Comparisons with state-of-the-art 

We compare our method (TMV-HLP) with the recent 
state-of-the-art models that report results or can be re¬ 
implemented by us on the three datasets in Table 
They cover a wide range of approaches on utilising 
semantic intermediate representation for zero-shot learn¬ 
ing. They can be roughly categorised according to the 
semantic representation(s) used: DAP and lAP (123, 
11^ ), M2LATM IHHI, ALE ffl, O and i50| use attributes 
only; HLE/AHLE ti and Mo/Ma/O/D IMI use both 
attributes and linguistic knowledge bases (same as us); 
Il54t uses attribute and some additional human manual 
annotation. Note that our linguistic knowledge base 
representation is in the form of word vectors, which does 
not incur additional manual annotation. Our method 
also does not exploit data-driven attributes such as 
M2LATM IHHI and Mo/Ma/O/D ||38l. 

Consider first the results on the most widely used 
AwA. Apart from the standard hand-crafted feature 
(H), we consider the more powerful OverFeat deep 
feature (O), and a combination of OverFeat and DeCAF 

9. We use the trained model of OverFeat in Ba. 
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Approach 

AwA {H |27l) 

AwA (O) 

AwA (O, V) 

USAA 

CUB (O) 

CUB (JT) 

DAP 

40.5(|27|) / 41.4(|28l) / 38.4* 

51.0* 

57.1* 

33.2(ll5l^ / 35.2* 

26.2* 

9.1* 

lAP 

27.8(1^) / 42.2(1^) 

- 

- 

- 

- 

- 

M2LATM nH *** 

41.3 

- 

- 

41.9 

- 

- 

ALE/HLE/AHLE Ql 

37.4/39.0/43.5 

- 

- 

- 

- 

18.0* 

Mo/Ma/O/D EH 

27.0 / 23.6 / 33.0 / 35.7 

- 

- 

- 

- 

- 

PST EH *** 

42.7 

54.1* 

62.9* 

36.2* 

38.3* 

13.2* 

El 

48.3** 

- 

- 

- 

- 

- 

TMV-BLP fl4t*** 

47.7 

69.9 

77.8 

48.2 

45.2 

16.3 

TMV-HLP 

49.0 

73.5 

80.5 

50.4 

47.9 

19.5 


Table 1 

Comparison with the state-of-the-art on zero-shot learning on AwA, USAA and CUB. Features H, O, V and J" 
represent hand-crafted, OverFeat, DeCAF, and Fisher Vector respectively. Mo, Ma, O and D represent the highest 
results by the mined object class-attribute associations, mined attributes, objectness as attributes and direct similarity 
methods used in [31] respectively. no result reported. *: our implementation. **: requires additional human 
annotations.***: requires unlabelled data, i.e. a transductive setting. 


{0,V^ Tabled shows that (1) with the same exper¬ 
imental settings and the same feature {H), our TMV- 
HLP (49.0%) outperforms the best result reported so far 
(48.3%) in ||54| which, unlike ours, requires additional 
human annotation to relabel the similarities between 
auxiliary and target classes. (2) With the more powerful 
OverFeat feature, our method achieves 73.5% zero-shot 
recognition accuracy. Even more remarkably, when both 
the OverFeat and DeCAF features are used in our frame¬ 
work, the result (see the AwA {0,V) column) is 80.5%. 
Even with only 10 target classes, this is an extremely 
good result given that we do not have any labelled 
samples from the target classes. Note that this good 
result is not solely due to the feature strength, as the 
margin between the conventional DAP and our TMV- 
HLP is much bigger indicating that our TMV-HLP plays 
a critical role in achieving this result. (3) Our method 
is also superior to the AHLE method in which also 
uses two semantic spaces: attribute and WordNet hier¬ 
archy. Different from our embedding framework, AHLE 
simply concatenates the two spaces. (4) Our method 
also outperforms the other alternatives of either mining 
other semantic knowledge bases (Mo/Ma/O/D |[38l ) 
or exploring data-driven attributes (M2LATM ||T3). (5) 
Among all compared methods, PST 1361 is the only 
one except ours that performs label propagation based 
transductive learning. It yields better results than DAP 
in all the experiments which essentially does nearest 
neighbour in the semantic space. TMV-HLP consistently 
beats PST in all the results shown in Table [T] thanks to 
our multi-view embedding. (6) Compared to our TMV- 
BLP model IT4l , the superior results of TMV-HLP shows 
that the proposed heterogeneous hypergraph is more 
effective than the homogeneous 2-graphs used in TMV- 
BLP for zero-shot learning. 

Table also shows that on two very different datasets: 
USAA video activity, and CUB fine-grained, our TMV- 
HLP significantly outperforms the state-of-the-art alter¬ 
natives. In particular, on the more challenging CUB, 

10. With these two low4evel feature views, there are six views in 
total in the embedding space. 


Soft Vs. Hard dimension weighting for CCA Contributions of CCA and LP 



Figure 4. (a) Comparing soft and hard dimension weight¬ 
ing of CCA for AwA. (b) Contributions of CCA and label 
propagation on AwA. and indicate the subspaces 
of target data from view A and V in r respectively. Hand¬ 
crafted features are used in both experiments. 


47.9% accuracy is achieved on 50 classes (chance level 
2%) using the OverFeat feature. Considering the fine¬ 
grained nature and the number of classes, this is even 
more impressive than the 80.5% result on AwA. 

6.2.2 Further evaluations 

Effectiveness of soft weighting for CCA embedding 
In this experiment, we compare the soft-weighting (Eq 
§) of CCA embedding space F (a strategy adopted in 
this work) with the conventional hard-weighting strat¬ 
egy of selecting the number of dimensions for CCA pro¬ 
jection. Fig.|^a) shows that the performance of the hard- 
weighting CCA depends on the number of projection 
dimensions selected (blue curve). In contrast, our soft- 
weighting strategy uses all dimensions weighted by the 
CCA eigenvalues, so that the important dimensions are 
automatically weighted more highly. The result shows 
that this strategy is clearly better and it is not very 
sensitive to the weighting parameter A, with choices of 
A > 2 all working well. 

Contributions of individual components There are 
two major components in our ZSL framework: CCA 
embedding and label propagation. In this experiment 
we investigate whether both of them contribute to the 
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strong performance. To this end, we compare the ZSL 
results on AwA with label propagation and without 
(nearest neighbour) before and after CCA embedding. In 
Fig- mb)/ we can see that: (i) Label propagation always 
helps regardless whether the views have been embedded 
using CCA, although its effects are more pronounced 
after embedding, (ii) Even without label propagation, 
i.e. using nearest neighbour for classification, the perfor¬ 
mance is improved by the CCA embedding. However, 
the improvement is bigger with label propagation. This 
result thus suggests that both CCA embedding and label 
propagation are useful, and our ZSL framework works 
the best when both are used. 

Transductive multi-view embedding To further val¬ 
idate the contribution of our transductive multi-view 
embedding space, we split up different views with and 
without embedding and the results are shown in Fig. 

In Figs.|^a) and (c), the hand-crafted feature 1-L and SIFT, 
MFCC and STIP low-level features are used for AwA and 
USAA respectively, and we compare V vs. r(T' + V), 
A vs. + A) and [V,^] vs. r(T' + V + A) (see the 
caption of Fig. for definitions). We use DAP for A 
and nearest neighbour for V and [V, .4], because the 
prototypes of V are not binary vectors so DAP cannot be 
applied. We use TMV-HLP for r(T' + V) and r(T' + A) 
respectively. We highlight the following observations: (1) 
After transductive embedding, r{A + V + A), r(A + V) 
and r(A + A) outperform [V,^], V and A respectively. 
This means that the transductive embedding is helpful 
whichever semantic space is used in rectifying the pro¬ 
jection domain shift problem by aligning the semantic 
views with low-level features. (2) The results of [V, A] are 
higher than those of A and V individually, showing that 
the two semantic views are indeed complementary even 
with simple feature level fusion. Similarly, our TMV- 
HLP on all views r{A V -h A) improves individual 
embeddings r{A + V) and r(T' + A). 

Embedding deep learning feature views also helps 
In Fig. I^b) three different low-level features are con¬ 
sidered for AwA: hand-crafted {H), OverFeat (O) and 
DeCAF features (V). The zero-shot learning results of 
each individual space are indicated as V^, An/ Vo, 
Ao/ Vt>, Av in Fig. [^b) and we observe that Vq > 
Vv > Vn and Ao > Av > An- That is OverFeat > 
DeCAF > hand-crafted features. It is widely reported 
that deep features have better performance than 'hand¬ 
crafted' features on many computer vision benchmark 
datasets El, m What is interesting to see here is 
that OverFeat > DeCAF since both are based on the 
same Convolutional Neural Network (CNN) model of 
12^ . Apart from implementation details, one significant 
difference is that DeCAF is pre-trained by ILSVRC2012 
while OverFeat by ILSVRC2013 which contains more 
animal classes meaning better (more relevant) features 
can be learned. It is also worth pointing out that: (1) 
With both OverFeat and DeCAF features, the number 
of views to learn an embedding space increases from 3 
to 9; and our results suggest that the more views, the 


better chance to solve the domain shift problem and the 
data become more separable as different views contain 
complementary information. (2) Figure [^b) shows that 
when all 9 available views (An/ V^, An/ Ax>, Vd, Av/ 
Aq, Vo and Ao) are used for embedding, the result 
is significantly better than those from each individual 
view. Nevertheless, it is lower than that obtained by 
embedding views (T’d, Vd, Av/ Ao/ Vo and Ao)- This 
suggests that view selection may be required when a 
large number of views are available for learning the 
embedding space. 

Embedding makes target classes more separable We 

employ t-SNE |13 to visualise the space Ao, Vo/ Ao 
and V{A -V A V V)o,v in Fig. It shows that even 
in the powerful OverFeat view, the 10 target classes 
are heavily overlapped (Fig. [^a)). It gets better in the 
semantic views (Figs, [^b) and (c)). However, when all 
6 views are embedded, all classes are clearly separable 
(Fig. I^d)). 

Running time In practice, for the AwA dataset with 
hand-crafted features, our pipeline takes less than 30 
minutes to complete the zero-shot classification task 
(over 6,180 images) using a six core 2.66GHz CPU 
platform. This includes the time for multi-view CCA em¬ 
bedding and label propagation using our heterogeneous 
hypergraphs. 

6.3 Annotation and beyond 

In this section we evaluate our multi-view embedding 
space for the conventional and novel annotation tasks 
introduced in Sec. HI 

Instance annotation by attributes To quantify the an¬ 
notation performance, we predict attributes/annotations 
for each target class instance for USAA, which has 
the largest instance level attribute variations among the 
three datasets. We employ two standard measures: mean 
average precision (mAP) and F-measure (FM) between 
the estimated and true annotation list. Using our multi¬ 
view embedding space, our method (FM: 0.341, mAP: 
0.355) outperforms significantly the baseline of directly 
estimating = /“^(x^) (FM: 0.299, mAP: 0.267). 
Zero-shot description In this task, we explicitly infer 
the attributes corresponding to a specified novel class, 
given only the textual name of that class without seeing 
any visual samples. Table illustrates this for AwA. 
Clearly most of the top/bottom 5 attributes predicted for 
each of the 10 target classes are meaningful (in the ideal 
case, all top 5 should be true positives and all bottom 5 
true negatives). Predicting the top-5 attributes for each 
class gives an F-measure of 0.236. In comparison, if we 
directly select the 5 nearest attribute name projection to 
the class name projection (prototype) in the word space, 
the F-measure is 0.063, demonstrating the importance of 
learning the multi-view embedding space. In addition 
to providing a method to automatically - rather than 
manually - generate an attribute ontology, this task is 
interesting because even a human could find it very 


12 


ZSL of AwA (hand-crafted features) ZSL of AwA(hand-crafted and deep features) Zero-shot learning of USAA 



Figure 5. Effectiveness of transductive multi-view embedding, (a) zero-shot learning on AwA using only hand-crafted 
features; (b) zero-shot learning on AwA using hand-crafted and deep features together; (c) zero-shot learning on 
USAA. [V,^] indicates the concatenation of semantic word and attribute space vectors. T{X -i- V) and + A) 
mean using low-level+semantic word spaces and low-level+attribute spaces respectively to learn the embedding. 
T{X + V -I- ^) indicates using all 3 views to learn the embedding. 



Representations 


Persian cat 

hippopotamus 

leopard 

humpback whale 
seal 

chimpanzee 

rat 

giant panda 

pig 

raccoon 


Figure 6. t-SNE Visualisation of (a) OverFeat view {Xq), (b) attribute view {Ao), (c) word vector view {Vo), and (d) 
transition probability of pairwise nodes computed by Eq ^ of TMV-HLP in (r{A vAv V)o,v)- The unlabelled target 
classes are much more separable in (d). 


challenging (effectively a human has to list the attributes 
of a class which he has never seen or been explicitly 
taught about, but has only seen mentioned in text). 

Zero prototype learning In this task we attempt the 
reverse of the previous experiment: inferring a class 
name given a list of attributes. Table illustrates this 
for USAA. Table |^a) shows queries by the groundtruth 
attribute definitions of some USAA classes and the top- 
4 ranked list of classes returned. The estimated class 
names of each attribute vector are reasonable - the top-4 
words are either the class name or related to the class 
name. A baseline is to use the textual names of the 
attributes projected in the word space (summing their 
word vectors) to search for the nearest classes in word 
space, instead of the embedding space. Table |^a) shows 


that the predicted classes in this case are reasonable, 
but significantly worse than querying via the embedding 
space. To quantify this we evaluate the average rank 
of the true name for each USAA class when queried 
by its attributes. For querying by embedding space, the 
average rank is an impressive 2.13 (out of 4.33M words 
with a chance-level rank of 2.17M), compared with the 
average rank of 110.24 by directly querying word space 
1321 with textual descriptions of the attributes. Table 
l^b) shows an example of 'TncrementaT' query using 
the ontology definition of birthday party Il5l . We first 
query the 'wrapped presents' attribute only, followed by 
adding 'small balloon' and all other attributes ('birthday 
songs and 'birthday caps'). The changing list of top 
ranked retrieved words intuitively reflects the expecta- 
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(a) Query by GT attributes of 

Query via embedding space 

Query attribute words in word space 

graduation party 

party, graduation, audience, caucus 

cheering, proudly, dressed, wearing 

music performance 

music, performance, musical, heavy metal 

sing, singer, sang, dancing 

wedding ceremony 

wedding ceremony, wedding, glosses, stag 

nun, christening, bridegroom, wedding ceremony 


(b) Attribute query 

Top ranked words 

wrapped presents 

music; performance; solo performances; performing 

+small balloon 

wedding; wedding reception; birthday celebration; birthday 

+birthday song +birthday caps 

birthday party; prom; wedding reception 


Table 3 

Zero prototype learning on USAA. (a) Querying classes by groundtruth (GT) attribute definitions of the specified 
classes, (b) An incrementally constructed attribute query for the birthday_party class. Bold indicates true positive. 


AwA 


Attributes 

pc 

T-5 

active, furry, tail, paws, ground. 

B-5 

swims, hooves, long neck, horns, arctic 

hp 

T-5 

old world, strong, quadrupedal, fast, walks 

B-5 

red, plankton, skimmers, stripes, tunnels 

ip 

T-5 

old world, active, fast, quadrupedal, muscle 

B-5 

plankton, arctic, insects, hops, tunnels 

hw 

T-5 

fish, smart, fast, group, flippers 

B-5 

hops, grazer, tunnels, fields, plains 

seal 

T-5 

old world, smart, fast, chew teeth, strong 

B-5 

fly, insects, tree, hops, tunnels 

cp 

T-5 

fast, smart, chew teeth, active, brown 

B-5 

tunnels, hops, skimmers, fields, long neck 

rat 

T-5 

active, fast, furry, new world, paws 

B-5 

arctic, plankton, hooves, horns, long neck 

gp 

T-5 

quadrupedal, active, old world, walks, furry 

B-5 

tunnels, skimmers, long neck, blue, hops 

pig 

T-5 

quadrupedal, old world, ground, furry, chew teeth 

B-5 

desert, long neck, orange, blue, skimmers 

rc 

T-5 

fast, active, furry, quadrupedal, forest 

B-5 

long neck, desert, tusks, skimmers, blue 


Table 2 

Zero-shot description of 10 AwA target classes, r is 
learned using 6 views (Ax), Vd, Av, Vo and Ao)- 
The true positives are highlighted in bold, pc, hp, Ip, hw, 
op, gp, and rc are short for Persian cat, hippopotamus, 
leopard, humpback whale, chimpanzee, giant panda, 
and raccoon respectively. T-5/B-5 are the top/bottom 5 
attributes predicted for each target class. 


tion of the combinatorial meaning of the attributes. 

7 Conclusions 

We identified the challenge of projection domain shift in 
zero-shot learning and presented a new framework to 
solve it by rectifying the biased projections in a multi¬ 
view embedding space. We also proposed a novel label- 
propagation algorithm TMV-HLP based on heteroge¬ 
neous across-view hypergraphs. TMV-HLP synergisti- 
cally exploits multiple intermediate semantic represen¬ 
tations, as well as the manifold structure of unlabelled 
target data to improve recognition in a unified way for 
zero shot, N-shot and zero-^N shot learning tasks. As a 
result we achieved state-of-the-art performance on the 
challenging AwA, CUB and USAA datasets. Finally, we 
demonstrated that our framework enables novel tasks of 
relating textual class names and their semantic attributes. 


A number of directions have been identified for future 
work. First, we employ CCA for learning the embed¬ 
ding space. Although it works well, other embedding 
frameworks can be considered (e.g. P49l). In the cur¬ 
rent pipeline, low-level features are first projected onto 
different semantic views before embedding. It should 
be possible to develop a unified embedding framework 
to combine these two steps. Second, under a realistic 
lifelong learning setting li&j, an unlabelled data point 
could either belong to a seen/auxiliary category or an 
unseen class. An ideal framework should be able to 
classify both seen and unseen classes 1441 . Finally, our 
results suggest that more views, either manually defined 
(attributes), extracted from a linguistic corpus (word 
space), or learned from visual data (deep features), can 
potentially give rise to better embedding space. More 
investigation is needed on how to systematically design 
and select semantic views for embedding. 
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8.1 Heterogeneous hypergraph vs. other graphs 

Apart from transductive multi-view embedding, another 
major contribution of this paper is a novel label propa¬ 
gation method based on heterogeneous hypergraph. To 
evaluate the effectiveness of our hypergraph label prop¬ 
agation, we compare with a number of alternative label 
propagation methods using other graph models. More 
specifically, within each view, two alternative graphs 
can be constructed: 2-graphs which are used in the 
classification on multiple graphs (C-MG) model 15^ , and 
conventional homogeneous hypergraph formed in each 
single view [?], ll3Ql . Since hypergraphs are typically 
combined with 2-graphs, we have 5 different multi-view 
graph models: 2-gr (2-graph in each view). Homo-hyper 
(homogeneous hypergraph in each view), Hete-hyper (our 
heterogeneous hypergraph across views), Homo-hyper+2- 
gr (homogeneous hypergraph combined with 2-graph 
in each view), and Hete-hyper-\-2-gr (our heterogeneous 
hypergraph combined with 2-graph, as in our TMV- 
HLP). In our experiments, the same random walk label 
propagation algorithm is run on each graph in AwA 
and USAA before and after transductive embedding to 
compare these models. 

From the results in Fig. we observe that: (1) The 
graph model used in our TMV-HLP {Hete-hyper+2-gr) 
yields the best performance on both datasets. (2) All 
graph models benefit from the embedding. In particular, 
the performance of our heterogeneous hypergraph de¬ 
grades drastically without embedding. This is expected 
because before embedding, nodes in different views are 
not aligned; so forming meaningful hyperedges across 
views is not possible. (3) Fusing hypergraphs with 2- 
graphs helps - one has the robustness and the other has 
the discriminative power, so it makes sense to combine 
the strengths of both. (4) After embedding, on its own, 
heterogeneous graphs are the best while homogeneous 
hypergraphs (Homo-hyper) are worse than 2-gr indicating 
that the discriminative power by 2-graphs over-weighs 
the robustness of homogeneous hypergraphs. 


Supplementary Material 

8 Further Evaluations on Zero-Shot 
Learning 


Comparing Alternative-LP on AwA Comparing Alternative-LP on USAA 



Comparing Alternative-LP on AwA 



Different Spaces 


Figure 7. Comparing alternative label propagation meth- Figure 8. Comparing our method with the alternative 
ods using different graphs before and after transductive C-MG and PST methods before and after transductive 
embedding (T-embed). The methods are detailed in text, embedding 
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8.2 Comparing with other transductive methods 

Rohrbarch et al. ||36| employed an alternative transduc¬ 
tive method, termed PST, for zero-shot learning. We 
compare with PST for zero-shot learning here and N- 
shot learning in Sec. 

Specifically, we use AwA dataset with hand-crafted 
features, semantic word vector V, and semantic attribute 
A. We compare with PST 13^ as well as the graph- 
based semi-supervised learning methods C-MG 1561 (not 
originally designed but can be used for zero-shot learn¬ 
ing) before and after our transductive embedding (T- 
embed). We use equal weights for each graph for C- 
MG and the same parameters from 1361 for PST. From 
Fig.[^ we make the following conclusions: (1) TMV-HLP 
in our embedding space outperforms both alternatives. 
(2) The embedding also improves both C-MG and PST, 
due to alleviated projection domain shift via aligning the 
semantic projections and low-level features. 

The reasons for TMV-HLP outperforming PST include: 
(1) Using multiple semantic views (PST is defined only 
on one), (2) Using hypergraph-based label propagation 
(PST uses only conventional 2-graphs), (3) Less depen¬ 
dence on good quality initial labelling heuristics required 
by PST - our TMV-HLP uses the trivial initial labels 
(each prototype labelled according to its class as in Eq 
(17) in the main manuscript). 

8.3 How many target samples are needed for learn¬ 
ing multi-view embedding? 

In our paper, we use all the target class samples to 
construct the transductive embedding CCA space. Here 
we investigate how many samples are required to con¬ 
struct a reasonable embedding. We use hand-crafted 
features (dimension: 10,925) of the AwA dataset with 
semantic word vector V (dimension: 1,000) and semantic 
attribute A (dimension: 85) to construct the CCA space. 
We randomly select 1%, 3%, 5%, 20%, 40%, 60%, and 
80% of the unlabelled target class instances to construct 
the CCA space for zero-shot learning using our TMV- 
HLP. Random sampling is repeated 10 times. The results 
shown in Fig. below demonstrate that only 5% of the 
full set of samples (300 in the case of AwA) are sufficient 
to learn a good embedding space. 

8.4 Can the embedding space be learned using the 
auxiliary dataset? 

Since we aim to rectify the projection domain shift 
problem for the target data, the multi-view embedding 
space is learned transductively using the target dataset. 
One may ask whether the multi-view embedding space 
learned using the auxiliary dataset can be of any use for 
the zero-shot classification of the target class samples. To 
answer this question, we conduct experiments by using 
the hand-crafted features (dimension: 10,925) of AwA 
dataset with semantic word vector V (dimension: 1,000) 
and semantic attribute A (dimension: 85). Auxiliary data 



Figure 9. Influence of varying the number of unlabelled 
target class samples used to learn the CCA space. 

from AwA are now used to learn the multi-view CCA, 
and we then project the testing data into this CCA space. 
We compare the results of our TMV-HLP on AwA using 
the CCA spaces learned from the auxiliary dataset versus 
unlabelled target dataset in Fig. It can be seen that 
the CCA embedding space learned using the auxiliary 
dataset gives reasonable performance; however, it does 
not perform as well as our method which learned the 
multi-view embedding space transductively using target 
class samples. This is likely due to not observing, and 
thus not being able to learn to rectify, the projection 
domain shift. 


TMV-HLP of AwA (hand-crafted feature) 



CCA with train images CCA with test images 


Figure 10. Comparing ZSL performance using CCA 
learned from auxiliary data versus unlabelled target data. 

8.5 Qualitative results 

Figure shows some qualitative results for zero-shot 
learning on AwA in terms of top 5 most likely classes 
predicted for each image. It shows that our TMV-HLP 
produces more reasonable ranked list of classes for each 
image, comparing to DAP and PST. 

9 N-Shot learning 

N-shot learning experiments are carried out on the three 
datasets with the number of target class instances la¬ 
belled (N) ranging from 0 (zero-shot) to 20. We also 
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N-shot Learning for AwA N-shot Learning for USA A N-shot Learning for CUB 




Figure 12. N-shot learning results with (+) and without (-) additional prototype information. 



TMV-HLP 


giant panda, 

leopard, seal, rat, 
raeeoon 


PST 


giant panda, 

seal, raeeoon, 
rat, leopard 



raccoon, 

seal, Persian eat, 
leopard, ehimpanzee 
leopard, 

humpbaek whale, 
raccoon, chimpanzee, 
Persian cat 



rat, Persian cat, 
chimpanzee, seal, 
raccoon 

chimpanzee, 
rat, Persian cat, 
raccoon, seal 


DAP 


leopard, giant panda, 
raccoon, seal, 
chimpanzee 


raccoon, chimpanzee, 
leopard, 

humpback whale, seal 


leopard, giant panda, 
seal, hippopotamus, 
Persian cat 


sometimes the N-shot learning results of TMV-HLP+ are 
worse than its zero-shot learning results when only few 
training labels are observed (e.g. on AwA, the TMV- 
HLP+ accuracy goes down before going up when more 
labelled instances are added). Note that when more 
labelled instances are available, TMV-HLP- starts to out¬ 
perform PST+, because it combines the different views 
of the training instances, and the strong effect of the 
prototypes is eventually outweighed. 


Figure 11. Qualitative results for zero-shot learning on 
AwA. Bold indicates correct class names. 


consider the situation 1361 where both a few training 
examples and a zero-shot prototype may be available 
(denoted with suffix +), and contrast it to the conven¬ 
tional N-shot learning setting where solely labelled data 
and no prototypes are available (denoted with suffix 
—). For comparison, PST+ is the method in 1^ which 
uses prototypes for the initial label matrix. SVM+ and 
M2LATM- are the SVM and M2LATM methods used 
in 1^ and 1151 respectively. For fair comparison, we 
modify the SVM- used in t28l into SVM+ (i.e., add the 
prototype to the pool of SVM training data). Note that 
our TMV-HLP can be used in both conditions but the 
PST method f36 i| only applies to the + condition. All 
experiments are repeated for 10 rounds with the average 
results reported. Evaluation is done on the remaining 
unlabelled target data. From the results shown in Fig. 12 


it can be seen that: (1) TMV-HLP+ always achieves 
the best performance, particularly given few training 
examples. (2) The methods that explore transductive 
learning via label propagation (TMV-HLP+, TMV-HLP- 
, and PST+) are clearly superior to those that do not 
(SMV+ and M2LATM-). (3) On AwA, PST+ outperforms 
TMV-HLP- with less than 3 instances per class. Because 
PST+ exploits the prototypes, this suggests that a single 
good prototype is more informative than a few labelled 
instances in N-shot learning. This also explains why 






































